

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  
  <title>mindspore.dataset.text.Vocab &mdash; MindSpore master documentation</title>
  

  
  <link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />

  
  

  
  

  

  
  <!--[if lt IE 9]>
    <script src="../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  
    
      <script type="text/javascript" id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
        <script src="../../_static/jquery.js"></script>
        <script src="../../_static/underscore.js"></script>
        <script src="../../_static/doctools.js"></script>
        <script src="../../_static/language_data.js"></script>
        <script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
    
    <script type="text/javascript" src="../../_static/js/theme.js"></script>

    
    <link rel="index" title="Index" href="../../genindex.html" />
    <link rel="search" title="Search" href="../../search.html" />
    <link rel="next" title="mindspore.dataset.transforms" href="../mindspore.dataset.transforms.html" />
    <link rel="prev" title="mindspore.dataset.text.to_bytes" href="mindspore.dataset.text.to_bytes.html" /> 
</head>

<body class="wy-body-for-nav">

   
  <div class="wy-grid-for-nav">
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search" >
          

          
            <a href="../../index.html" class="icon icon-home"> MindSpore
          

          
          </a>

          
            
            
          

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        
        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
              
            
            
              <p class="caption"><span class="caption-text">MindSpore Python API</span></p>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../mindspore.html">mindspore</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.common.initializer.html">mindspore.common.initializer</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.communication.html">mindspore.communication</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.compression.html">mindspore.compression</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.context.html">mindspore.context</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.dataset.html">mindspore.dataset</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.dataset.audio.html">mindspore.dataset.audio</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.dataset.config.html">mindspore.dataset.config</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../mindspore.dataset.text.html">mindspore.dataset.text</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../mindspore.dataset.text.html#mindspore-dataset-text-transforms">mindspore.dataset.text.transforms</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="../mindspore.dataset.text.html#mindspore-dataset-text-utils">mindspore.dataset.text.utils</a><ul class="current">
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.JiebaMode.html">mindspore.dataset.text.JiebaMode</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.NormalizeForm.html">mindspore.dataset.text.NormalizeForm</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.SentencePieceModel.html">mindspore.dataset.text.SentencePieceModel</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.SentencePieceVocab.html">mindspore.dataset.text.SentencePieceVocab</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.SPieceTokenizerLoadType.html">mindspore.dataset.text.SPieceTokenizerLoadType</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.SPieceTokenizerOutType.html">mindspore.dataset.text.SPieceTokenizerOutType</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.to_str.html">mindspore.dataset.text.to_str</a></li>
<li class="toctree-l3"><a class="reference internal" href="mindspore.dataset.text.to_bytes.html">mindspore.dataset.text.to_bytes</a></li>
<li class="toctree-l3 current"><a class="current reference internal" href="#">mindspore.dataset.text.Vocab</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.dataset.transforms.html">mindspore.dataset.transforms</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.dataset.vision.html">mindspore.dataset.vision</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.mindrecord.html">mindspore.mindrecord</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.nn.html">mindspore.nn</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.nn.probability.html">mindspore.nn.probability</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.nn.transformer.html">mindspore.nn.transformer</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.numpy.html">mindspore.numpy</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.ops.html">mindspore.ops</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.parallel.html">mindspore.parallel</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.parallel.nn.html">mindspore.parallel.nn</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.profiler.html">mindspore.profiler</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.scipy.html">mindspore.scipy</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.train.html">mindspore.train</a></li>
<li class="toctree-l1"><a class="reference internal" href="../mindspore.boost.html">mindspore.boost</a></li>
</ul>
<p class="caption"><span class="caption-text">MindSpore C++ API</span></p>
<ul>
<li class="toctree-l1"><a class="reference external" href="https://www.mindspore.cn/lite/api/zh-CN/master/api_cpp/mindspore.html">MindSpore Lite↗</a></li>
</ul>

            
          
        </div>
        
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" aria-label="top navigation">
        
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../index.html">MindSpore</a>
        
      </nav>


      <div class="wy-nav-content">
        
        <div class="rst-content">
        
          

















<div role="navigation" aria-label="breadcrumbs navigation">

  <ul class="wy-breadcrumbs">
    
      <li><a href="../../index.html" class="icon icon-home"></a> &raquo;</li>
        
          <li><a href="../mindspore.dataset.text.html">mindspore.dataset.text</a> &raquo;</li>
        
      <li>mindspore.dataset.text.Vocab</li>
    
    
      <li class="wy-breadcrumbs-aside">
        
          
            <a href="../../_sources/api_python/dataset_text/mindspore.dataset.text.Vocab.rst.txt" rel="nofollow"> View page source</a>
          
        
      </li>
    
  </ul>

  
  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
  <div class="section" id="mindspore-dataset-text-vocab">
<h1>mindspore.dataset.text.Vocab<a class="headerlink" href="#mindspore-dataset-text-vocab" title="Permalink to this headline">¶</a></h1>
<dl class="class">
<dt id="mindspore.dataset.text.Vocab">
<em class="property">class </em><code class="sig-prename descclassname">mindspore.dataset.text.</code><code class="sig-name descname">Vocab</code><a class="headerlink" href="#mindspore.dataset.text.Vocab" title="Permalink to this definition">¶</a></dt>
<dd><p>用于查找单词的vocab对象。</p>
<p>它包含一个映射，将每个单词（str）映射到一个ID（int）。</p>
<dl class="method">
<dt id="mindspore.dataset.text.Vocab.from_dataset">
<code class="sig-name descname">from_dataset</code><span class="sig-paren">(</span><em class="sig-param">dataset</em>, <em class="sig-param">columns=None</em>, <em class="sig-param">freq_range=None</em>, <em class="sig-param">top_k=None</em>, <em class="sig-param">special_tokens=None</em>, <em class="sig-param">special_first=True</em><span class="sig-paren">)</span><a class="headerlink" href="#mindspore.dataset.text.Vocab.from_dataset" title="Permalink to this definition">¶</a></dt>
<dd><p>通过数据集构建vocab对象。</p>
<p>这将收集数据集中的所有唯一单词，并在freq_range中用户指定的频率范围内返回一个vocab。如果没有单词在该频率上，用户将收到预警信息。
vocab中的单词按最高频率到最低频率的顺序进行排列。具有相同频率的单词将按词典顺序进行排列。</p>
<p><strong>参数：</strong>
- <strong>dataset</strong> (Dataset) - 表示要从中构建vocab的数据集。
- <strong>columns</strong> (list[str]，可选) - 表示要从中获取单词的列名。它可以是列名的列表，默认值：None。如果没有列是string类型，将返回错误。
- <strong>freq_range</strong> (tuple，可选) - 表示整数元组（min_frequency，max_frequency）。频率范围内的单词将被保留。0 &lt;= min_frequency &lt;= max_frequency &lt;= total_words。min_frequency=0等同于min_frequency=1。max_frequency &gt; total_words等同于max_frequency = total_words。min_frequency和max_frequency可以为None，分别对应于0和total_words，默认值：None。
- <strong>top_k</strong> (int，可选) - <cite>top_k</cite> 大于0。要在vocab中 <cite>top_k</cite> 建立的单词数量表示取用最频繁的单词。 <cite>top_k</cite> 在 <cite>freq_range</cite> 之后取用。如果没有足够的 <cite>top_k</cite> ，所有单词都将被取用,默认值：None。
- <strong>special_tokens</strong> (list，可选) - 表示字符串列表。每个字符串都是一个特殊的标记。例如，special_tokens=[“&lt;pad&gt;”,”&lt;unk&gt;”]，默认值：None，表示不添加特殊标记。
- <strong>Special_first</strong> (bool，可选) - 表示是否添加 <cite>special_tokens</cite> 到vocab。如果指定了 <cite>special_tokens</cite> 并将 <cite>special_first</cite> 设置为True，则添加special_tokens，默认值：True。</p>
<p><strong>返回：</strong>
表示从数据集构建的vocab。</p>
<p><strong>样例：</strong></p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">TextFileDataset</span><span class="p">(</span><span class="s2">&quot;/path/to/sentence/piece/vocab/file&quot;</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">vocab</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">Vocab</span><span class="o">.</span><span class="n">from_dataset</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="s2">&quot;text&quot;</span><span class="p">,</span> <span class="n">freq_range</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="gp">... </span>                                <span class="n">special_tokens</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;&lt;pad&gt;&quot;</span><span class="p">,</span> <span class="s2">&quot;&lt;unk&gt;&quot;</span><span class="p">],</span>
<span class="gp">... </span>                                <span class="n">special_first</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">operations</span><span class="o">=</span><span class="n">text</span><span class="o">.</span><span class="n">Lookup</span><span class="p">(</span><span class="n">vocab</span><span class="p">,</span> <span class="s2">&quot;&lt;unk&gt;&quot;</span><span class="p">),</span> <span class="n">input_columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;text&quot;</span><span class="p">])</span>
</pre></div>
</div>
</dd></dl>

<dl class="method">
<dt id="mindspore.dataset.text.Vocab.from_dict">
<code class="sig-name descname">from_dict</code><span class="sig-paren">(</span><em class="sig-param">word_dict</em><span class="sig-paren">)</span><a class="headerlink" href="#mindspore.dataset.text.Vocab.from_dict" title="Permalink to this definition">¶</a></dt>
<dd><p>从dict中构建vocab对象。</p>
<p><strong>参数：</strong>
- <strong>word_dict</strong> (dict) - 字典包含word和ID对，其中 <cite>word</cite> 应是string类型， <cite>ID</cite> 应是int类型。至于 <cite>ID</cite> ，建议从0开始并且不断续。如果 <cite>ID</cite> 为负数，将引发ValueError。</p>
<p><strong>返回：</strong>
Vocab，表示从 <cite>dict</cite> 构建的vocab对象。</p>
<p><strong>样例：</strong></p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">vocab</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">Vocab</span><span class="o">.</span><span class="n">from_dict</span><span class="p">({</span><span class="s2">&quot;home&quot;</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="s2">&quot;behind&quot;</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span> <span class="s2">&quot;the&quot;</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span> <span class="s2">&quot;world&quot;</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span> <span class="s2">&quot;&lt;unk&gt;&quot;</span><span class="p">:</span> <span class="mi">6</span><span class="p">})</span>
</pre></div>
</div>
</dd></dl>

<dl class="method">
<dt id="mindspore.dataset.text.Vocab.from_file">
<code class="sig-name descname">from_file</code><span class="sig-paren">(</span><em class="sig-param">file_path</em>, <em class="sig-param">delimiter=''</em>, <em class="sig-param">vocab_size=None</em>, <em class="sig-param">special_tokens=None</em>, <em class="sig-param">special_first=True</em><span class="sig-paren">)</span><a class="headerlink" href="#mindspore.dataset.text.Vocab.from_file" title="Permalink to this definition">¶</a></dt>
<dd><p>从单词列表构建一个vocab对象。</p>
<p><strong>参数：</strong>
- <strong>file_path</strong> (str) - 表示包含vocab列表的文件的路径。
- <strong>delimiter</strong> (str，可选) - 表示用来分隔文件中每一行的分隔符。第一个元素被视为单词，默认值：””。
- <strong>vocab_size</strong> (int，可选) - 表示要从 <cite>file_path</cite> 读取的字数，默认值：None，表示读取所有的字。
- <strong>special_tokens</strong> (list，可选) - 表示字符串的列表。每个字符串都是一个特殊标记，例如special_tokens=[“&lt;pad&gt;”,”&lt;unk&gt;”]，默认值：None，表示不添加特殊标记）。
- <strong>special_first</strong> (list，可选) - 表示是否添加 <cite>special_tokens</cite> 到vocab。如果指定了 <cite>special_tokens</cite> 并将 <cite>special_first</cite> 设置为True，则添加 <cite>special_tokens</cite> ，默认值：True。</p>
<p><strong>返回：</strong>
Vocab，表示从文件构建的vocab。</p>
<p><strong>样例：</strong></p>
<div class="doctest highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="c1"># Assume vocab file contains the following content:</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># --- begin of file ---</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># apple,apple2</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># banana, 333</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># cat,00</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># --- end of file ---</span>
<span class="go">&gt;&gt;&gt;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># Read file through this API and specify &quot;,&quot; as delimiter.</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># The delimiter will break up each line in file, then the first element is taken to be the word.</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">vocab</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">Vocab</span><span class="o">.</span><span class="n">from_file</span><span class="p">(</span><span class="s2">&quot;/path/to/simple/vocab/file&quot;</span><span class="p">,</span> <span class="s2">&quot;,&quot;</span><span class="p">,</span> <span class="kc">None</span><span class="p">,</span> <span class="p">[</span><span class="s2">&quot;&lt;pad&gt;&quot;</span><span class="p">,</span> <span class="s2">&quot;&lt;unk&gt;&quot;</span><span class="p">],</span> <span class="kc">True</span><span class="p">)</span>
<span class="go">&gt;&gt;&gt;</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c1"># Finally, there are 5 words in the vocab: &quot;&lt;pad&gt;&quot;, &quot;&lt;unk&gt;&quot;, &quot;apple&quot;, &quot;banana&quot;, &quot;cat&quot;.</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">vocabulary</span> <span class="o">=</span> <span class="n">vocab</span><span class="o">.</span><span class="n">vocab</span><span class="p">()</span>
</pre></div>
</div>
</dd></dl>

</dd></dl>

<dl class="method">
<dt id="from_list">
<code class="sig-name descname">from_list</code><span class="sig-paren">(</span><em class="sig-param">word_list</em>, <em class="sig-param">special_tokens=None</em>, <em class="sig-param">special_first=True</em><span class="sig-paren">)</span><a class="headerlink" href="#from_list" title="Permalink to this definition">¶</a></dt>
<dd><p>从单词列表构建一个vocab对象。</p>
<p><strong>参数：</strong>
- <strong>word_list</strong> (list) - 表示字符串列表，其中每个元素都是type类型的单词。
- <strong>special_tokens</strong> (list，可选) - 表示字符串的列表。每个字符串都是一个特殊标记，例如special_tokens=[“&lt;pad&gt;”,”&lt;unk&gt;”]，默认值：None，表示不添加特殊标记。
- <strong>Special_first</strong> (bool，可选) - 表示是否添加 <cite>special_tokens</cite> 到vocab。如果指定了 <cite>special_tokens</cite> 并将 <cite>special_first</cite> 设置为True，则添加 <cite>special_tokens</cite> ，默认值：True。</p>
<p><strong>返回：</strong>
Vocab，表示从 <cite>list</cite> 构建的vocab。</p>
</dd></dl>

</div>


           </div>
           
          </div>
          <footer>
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
        <a href="../mindspore.dataset.transforms.html" class="btn btn-neutral float-right" title="mindspore.dataset.transforms" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
        <a href="mindspore.dataset.text.to_bytes.html" class="btn btn-neutral float-left" title="mindspore.dataset.text.to_bytes" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>
        &#169; Copyright 2021, MindSpore.

    </p>
  </div>
    
    
    
    Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
    
    <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
    
    provided by <a href="https://readthedocs.org">Read the Docs</a>. 

</footer>
        </div>
      </div>

    </section>

  </div>
  

  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

  
  
    
   

</body>
</html>